Principal Component Analysis (PCA)

The Arabidopsis dataset includes 2883 SNP variants for 249 individuals. For the analysis of the population structure, the following packages were initally installed:

#BiocManager::install("gdsfmt")
library("SNPRelate")  ## provide a binary format for SNP data in GWAS utilizing GDS data files
vignette("SNPRelate") ## not found

and, after setting the working environment, the vcf file (Variant Call Format, typical format to store the genetic data) has been uploaded and read with the command

vcf <- "C:/Users/Camilla/Documents/UZH.internship/Courses&conferences/Adaptomics/data/data_sweden_sweep_region_chr5_snps_only_fixed.vcf"

the VCF was converted into a GDS (Genomic Data Structure) format, a format that offers the efficient operations specifically designed for integers with two bits, since a SNP could occupy only two bits.

setwd("C:/Users/Camilla/Documents/UZH.internship/Courses&conferences/Adaptomics/ALStructure")
snpgdsVCF2GDS(vcf, "arabidopsis.gds", method = "biallelic.only", ignore.chr.prefix = "lcl|Cp")
## Start file conversion from VCF to SNP GDS ...
## Method: exacting biallelic SNPs
## Number of samples: 249
## Parsing "C:/Users/Camilla/Documents/UZH.internship/Courses&conferences/Adaptomics/data/data_sweden_sweep_region_chr5_snps_only_fixed.vcf" ...
##  import 7561 variants.
## + genotype   { Bit2 249x7561, 459.6K } *
## Optimize the access efficiency ...
## Clean up the fragments of GDS file:
##     open the file 'arabidopsis.gds' (476.8K)
##     # of fragments: 47
##     save to 'arabidopsis.gds.tmp'
##     rename 'arabidopsis.gds.tmp' (476.5K, reduced: 324B)
##     # of fragments: 20
genofile <- snpgdsOpen("arabidopsis.gds")

The PCA was conducted for an exploratory analysis of the dataset and to individuate potential clusters.

arab.pca <- snpgdsPCA(genofile)
## Principal Component Analysis (PCA) on genotypes:
## Excluding 0 SNP on non-autosomes
## Excluding 4,678 SNPs (monomorphic: TRUE, MAF: NaN, missing rate: NaN)
##     # of samples: 249
##     # of SNPs: 2,883
##     using 1 thread
##     # of principal components: 32
## PCA:    the sum of all selected genotypes (0,1,2) = 1074426
## CPU capabilities: Double-Precision SSE2
## Wed Sep 30 23:33:11 2020    (internal increment: 3008)
## 
[..................................................]  0%, ETC: ---        
[==================================================] 100%, completed, 0s
## Wed Sep 30 23:33:11 2020    Begin (eigenvalues and eigenvectors)
## Wed Sep 30 23:33:11 2020    Done.

as a result, the 2 883 SNPs were summarised by 32 PC. The GDS file must be closed to proceed with the analysis:

Visualization of the PCA results: From this first inspection, it seems that the PC1 separates the individuals into two clusters, with some other individuals not belonging to any group, maybe some outliers. The PC2, PC3 and PC4 do not seem to separate the data into groups.

Percentage of variability explained by each component (conversion of data into percentages)

pc.percent <- arab.pca$varprop*100
perc <- round(pc.percent, 2)
perc
##   [1] 10.94  9.45  5.29  5.03  4.56  4.06  3.68  3.25  3.02  2.82  2.72  2.23
##  [13]  2.12  1.93  1.46  1.40  1.29  1.26  1.22  1.21  1.16  1.11  1.00  0.94
##  [25]  0.90  0.83  0.80  0.77  0.76  0.71  0.68  0.63   NaN   NaN   NaN   NaN
##  [37]   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN
##  [49]   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN
##  [61]   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN
##  [73]   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN
##  [85]   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN
##  [97]   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN
## [109]   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN
## [121]   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN
## [133]   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN
## [145]   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN
## [157]   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN
## [169]   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN
## [181]   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN
## [193]   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN
## [205]   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN
## [217]   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN
## [229]   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN
## [241]   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN

Together, the first 4 PC explain about 30% of the total variation, with PC1 and PC2 explaining a highly similar percentage of variability.

It is possible to build a 3D representation of the PCAnalysis with the plotly R library, to visualize simultaneously the first 3 PC.

library("ggplot2")
devtools::install_github("ropensci/plotly")
library("plotly")
arab_pca_df <- data.frame(sample.id = arab.pca$sample.id,   #building the df
                             PC1 = arab.pca$eigenvect[,1],
                             PC2 = arab.pca$eigenvect[,2],
                             PC3 = arab.pca$eigenvect[,3],
                             stringsAsFactors = F)
## ggplot for PC1 and PC2
p1 <- ggplot(arab_pca_df, aes(PC1, PC2, label = sample.id)) +
  geom_point(size = 2) +
  xlab(paste("PC1: ", perc[1], sep = "")) +
  ylab(paste("PC2: ", perc[2], sep = ""))
p1

ggplotly(p1)  
## 2D plot 
plot_ly(arab_pca_df, x = ~PC1, y = ~PC2, text = ~sample.id, size = 1 ) %>%
  layout(xaxis=list(title = paste("PC1: ", perc[1], "%", sep = "")),
         yaxis=list(title = paste("PC2: ", perc[1], "%", sep = "")))
## 3D plot 
plot_ly(arab_pca_df,
        x=~PC1, y=~PC2, z=~PC3,
        text=~sample.id,
        size=1)  %>%
        layout(scene = list(xaxis=list(title=paste("PC1: ", perc[1], "%", sep="")),
                            yaxis=list(title=paste("PC2: ", perc[2], "%", sep="")),
                            zaxis=list(title=paste("PC3: ", perc[3], "%", sep=""))))

ALStructure analysis

The purpose of the ALStructure analysis is to identify the population structure at a deeper level than the PCAnalysis, it provides direct estimates of admixture proportions without requiring additional assumptions other that HWE and absence of LD within populations between all markers. The alstructure package contains functions for fitting the admixture model given a SNP matrix using the efficient ALStructure algorithm described from [@cabreros_storey_2017].

Loading libraries and data

#install.packages("RColorBrewer")
#install.packages("devtools")
#install.packages("gridExtra")
#install_github("storeylab/alstructure", build_vignettes=TRUE)
#vignette("ALStructure_workflow", package="alstructure")
#vignette("SNPRelate")
library("RColorBrewer") # for the visualization of the ALTStructure results
library("devtools")
library("SNPRelate")
library("alstructure")
library("vcfR")
library("gridExtra")  ## functions t work with "grid" graphics and draw tables

## in case the previous is not working
#install_github("zhengxwen/gdsfnt")
#install_github("zhengxwen/SNPRelate")
#install_github("storeylab/alstructure", build_vignettes  = T)

Data has been loaded and converted into numeric

vcf.file <- "C:/Users/Camilla/Documents/UZH.internship/Courses&conferences/Adaptomics/data/data_sweden_sweep_region_chr5_snps_only_fixed.vcf"
vcf <- read.vcfR(vcf.file)
## Scanning file to determine attributes.
## File attributes:
##   meta lines: 15
##   header_line: 16
##   variant count: 7561
##   column count: 258
## 
Meta line 15 read in.
## All meta lines processed.
## gt matrix initialized.
## Character matrix gt created.
##   Character matrix gt rows: 7561
##   Character matrix gt cols: 258
##   skip: 0
##   nrows: 7561
##   row_num: 0
## 
Processed variant 1000
Processed variant 2000
Processed variant 3000
Processed variant 4000
Processed variant 5000
Processed variant 6000
Processed variant 7000
Processed variant: 7561
## All variants processed
# Converting genotypes (0/0 , 0/1, 1/1) to numeric 
  ## 0 (homozygous for the reference allele), 1 (heterozygous), 2 (homozygous for the alternative allele))
vcf_to_numeric <- function(vcf) {
  require(vcfR)
  out <- apply(extract.gt(vcf), 2, function(x) {
    c(0,0,1,1,1,2,2)[match(x, c("0/0", "0|0", "0/1", "0|1", "1/0", "1|0", "1/1", "1|1"))]
  })
  return(out)
}

vcf_numeric <- vcf_to_numeric(vcf)

Estimating d

Estimate the number of ancestral populations present in the dataset (corresponding to K value)

dhat <- estimate_d(vcf_numeric)
## Error in eigen(Wm): infinite or missing values in 'x'

...missing data prevent the analysis to work properly, they have indeed strong influence on the inference. They can be replaced with the mean (i.e. the overall mean for the genoptype)

Impute missing data

vcf_numeric_imp <- t(apply(vcf_numeric, 1, function(x) {
  x[is.na(x)] <- mean(x, na.rm=T)
  return(x)
}))

Estimate d again

dhat <- estimate_d(vcf_numeric_imp)
dhat  
## [1] 4

The estimate number of populations seems to be 4. The ALStructure algorithm can then be run setting d = 4

# Run alstructure with the defined k value
fit <- alstructure(vcf_numeric_imp, d=dhat)

# Write names of the samples to the column names of matrix Q (ancestry proportion of each individual)
colnames(fit$Q_hat) <- colnames(vcf_numeric_imp) 
# Write names of the SNPs to the rows of the matrix P (ancestral allele frequencies)
rownames(fit$P_hat) <- vcf@fix[,"ID"] 
# there is no ID in the VCF file, but can be useful if that information is available

Plot results of ALStructure analysis

Displaying admixture coefficients is important in order to interpret the outputs of the analysis, and in general to understand the genetic structure of a population in geographic space.

# Function to sort the admixture coefficients
sort.admixQ <- function(Q, thres = 0.5){
  Q <- as.matrix(Q)
  nonadmix <- Q[apply(Q, 1, function(x){
    return(any(x >= thres))
  }),]
  admix <- Q[!apply(Q, 1, function(x){
    return(any(x >= thres))
  }),]
  
# grouping of admix and non admixed individuals 
grouping <- apply(nonadmix, 1, which.max)
  
for(i in 1:ncol(Q)){
  print(i)
  temp <- nonadmix[grouping == i,]
  if(is.null(dim(temp))){
    sort.temp <- temp
  }else{
    sort.temp <- temp[do.call(order,args = list(-temp[,i])),]
  }
  
  if(i == 1){sort.nonadmix <- sort.temp
  }else{
    sort.nonadmix <- rbind(sort.nonadmix, sort.temp)
    if(is.null(dim(temp))){
      rownames(sort.nonadmix)[nrow(sort.nonadmix)] <- rownames(nonadmix)[grouping == i]
    }
  }
}
  
sort.admix <- admix[do.call(order, args = c(as.list(as.data.frame(-1*admix)))),]
return(rbind.data.frame(sort.nonadmix, sort.admix))
}

# sorting the Q matrix (it is necessary to transpose it)
sorted.alsQ <- sort.admixQ(t(fit$Q_hat), thres=0.5)
## [1] 1
## [1] 2
## [1] 3
## [1] 4
# Set the color palette (RColorBrewer library)
myCols <- brewer.pal(4, "Set1")

# barplot
barplot(t(sorted.alsQ ), col=myCols, space = 0.1, border = NA, ylab="Q coefficient",las=2)

The pattern of the ALStructure barplot suggests the presence of 4 groups of individuals with distinct genotypes, i.e. the individuals to which one unique colore was assigned. In addition, there are some individuals that may derive from an event of admixture between these potential ancestor populations, ideed, for these groups of individuals it seems to be possible to attribute part of their genotype to the ""purple" population, and the remaining to the "green" and either "blue" or "red" populations. "single color" populations, i.e. the potential ancestor groups. However, these results should be interpreted with precaution, since the algorithm is not based on direct distance metrics between populations, moreover, it is highly sensitive to missing data (which were present, but replaced with mean values).

Cross validation

The cross-validation is a method based on machine learning useful to indentify the best value K.

# Function to calculate cross-validation
cross.valid.alstructure <- function(X, from = 2, to = 10, fold = 5){
  library(alstructure)
  k = seq(from, to, by = 1)
  notmiss <- which(!is.na(X)) # the coordinates of non-missing genotypes
  
  d.mean.square <- c()
  
  for(i in k){
    
    indx <- sample(rep(1:fold, len = length(notmiss)), size = length(notmiss), replace = F)  
    # use index to randomly assign non-missing values into n groups for cross-validation
    d <- c()
    for(n in 1:fold){
      
      tmp.X <- X
      tmp.X[notmiss[indx == n]] <- NA # mask selected data
      tmp.fit <- alstructure(tmp.X, d = i)
      fitted.F <- 2*(tmp.fit$P_hat %*% tmp.fit$Q_hat)
      d <- c(d, X[notmiss[indx == n]] - fitted.F[notmiss[indx == n]])
      
    }# n fold cross validation
    d.mean.square <- c(d.mean.square, mean(d^2))
    
  } # for i loop end
  names(d.mean.square) <- paste0("K", from:to)
  return(d.mean.square)
} 
# end of cross-validation

  # Cross-validation errors 
  # (from how many to how many K we want to test and how many folds should run)
d.mean.square <- cross.valid.alstructure(X = vcf_numeric_imp , from = 2, to = 4, fold = 5)
  
  # Plot cross-validation error in a very basic manner
plot(y = d.mean.square, x = 2:4, xlab = "K",main = "ALStructure, K= 2-4")

The cross-validation error for K = 4 is the lowest, suggesting that the analysis conducted with K = 4 returns the most meaninful results to structure the data. This result may match with the results of the PCA analysis, the groups with both blue-red colors may correspond to the two most evident groups, while the other individuals may correspond to the other individuals most dispersed in the PCA plot.

eSMC analysis

After having defined a plausible number of populations (4) through the structure analysis, it may be interesting try to infere the past demotraphic history and self-fertilization rates. The ecological Sequentially Markovian Coalescent (eSMC) method can estimate the changes in population size over the past generations. this information can than be used to compare the processes that shaped the different populations.

Installation

Download the tar.gz file: https://github.com/TPPSellinger/eSMC/blob/master/eSMC_1.0.5.tar.gz Information about vignettes from the repository: https://github.com/TPPSellinger/eSMC

Libraries

# library(devtools) 
# path="/path/to/eSMC_1.0.0.tar.gz" # please put the path to the downladed file
# devtools::install_local(path)
# library(vcfR)
library(tidyverse)
## -- Attaching packages -------------------------------------------------------------- tidyverse 1.3.0 --
## v tibble  3.0.1     v dplyr   1.0.0
## v tidyr   1.1.0     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0
## v purrr   0.3.4
## -- Conflicts ----------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::combine() masks gridExtra::combine()
## x dplyr::filter()  masks plotly::filter(), stats::filter()
## x dplyr::lag()     masks stats::lag()
library(eSMC)
library(stringdist)
## 
## Attaching package: 'stringdist'
## The following object is masked from 'package:tidyr':
## 
##     extract

Read and format data

Reading Vcf file and then extract the information in the proper format that esmc can read.

# For the analysis it is required:
## ind_list - a list of the names of the individuals that you want to perform the analysis
## chr_list - a list of the names of the chromosomes that you want to include in the analsyis
## vcf - a vcf file that was read with vcfR
# The final output is a physical file that will then be further imported in the following steps

Preparing the data

The vcf file is already read from the previous analysis

vcf
## ***** Object of Class vcfR *****
## 249 samples
## 1 CHROMs
## 7,561 variants
## Object size: 16 Mb
## 0 percent missing data
## *****        *****         *****
# get sample names
sample_names <- colnames(vcf@gt)[-1]
sample_names ##249 samples
##   [1] "1254" "1257" "1552" "5856" "5860" "6009" "6010" "6011" "6012" "6013"
##  [11] "6016" "6017" "6025" "6030" "6043" "6046" "6064" "6069" "6070" "6071"
##  [21] "6153" "6154" "6163" "6166" "6169" "6172" "6173" "6174" "6177" "6184"
##  [31] "6209" "6210" "6214" "6216" "6217" "6218" "6220" "6221" "6231" "6235"
##  [41] "6237" "6238" "6240" "6241" "6244" "6900" "6901" "6913" "6917" "6918"
##  [51] "6968" "6969" "8227" "8351" "8376" "9321" "9323" "9332" "9363" "9371"
##  [61] "9386" "9388" "9427" "9433" "991"  "992"  "997"  "1002" "1006" "1061"
##  [71] "1062" "1063" "1066" "1070" "1158" "1166" "5830" "5831" "5832" "5836"
##  [81] "5865" "5867" "6019" "6020" "6021" "6022" "6023" "6024" "6034" "6035"
##  [91] "6036" "6038" "6040" "6041" "6073" "6074" "6076" "6077" "6085" "6087"
## [101] "6088" "6090" "6091" "6092" "6094" "6095" "6096" "6097" "6098" "6099"
## [111] "6100" "6101" "6102" "6104" "6105" "6106" "6107" "6109" "6111" "6112"
## [121] "6113" "6114" "6115" "6118" "6119" "6122" "6123" "6124" "6125" "6126"
## [131] "6128" "6131" "6132" "6133" "6134" "6136" "6137" "6138" "6140" "6141"
## [141] "6142" "6145" "6148" "6149" "6150" "6151" "6188" "6189" "6191" "6192"
## [151] "6193" "6194" "6195" "6198" "6201" "6202" "6203" "6242" "6284" "6413"
## [161] "6974" "7013" "7164" "7343" "7346" "7354" "8222" "8230" "8231" "8234"
## [171] "8237" "8240" "8241" "8242" "8247" "8249" "8258" "8259" "8283" "8306"
## [181] "8307" "8326" "8335" "8369" "8422" "8426" "8427" "9057" "9339" "9352"
## [191] "9353" "9369" "9370" "9380" "9390" "9391" "9392" "9395" "9399" "9402"
## [201] "9404" "9405" "9407" "9408" "9409" "9412" "9413" "9416" "9421" "9436"
## [211] "9437" "9450" "9451" "9452" "9453" "9454" "9455" "9470" "9471" "9481"
## [221] "1313" "1317" "6039" "6042" "6086" "6258" "8256" "8334" "9058" "9336"
## [231] "9343" "9381" "9382" "9383" "9394" "6243" "8386" "6180" "6252" "6255"
## [241] "6268" "6276" "6973" "7516" "7517" "8387" "9442" "9476" "6108"
# get the chromosome names
chrom_names <- unique(vcf@fix[,"CHROM"])
chrom_names  ## only 1 chromosome in arabidopsis database
## [1] "5"
# Random subset chromosomes and samples (the complete dataset would take too long)
N_CHR_KEEP = 1  # 1 chromosome in the dataset
N_IND_KEEP = 10 # keeping 10 samples

chr_list <- sample(chrom_names, N_CHR_KEEP) # list of chromosomes 
ind_list <- sample(sample_names,N_IND_KEEP) # list of samples 

chr_list  
## [1] "5"
ind_list 
##  [1] "8334" "1070" "6106" "6148" "6070" "9332" "5836" "9343" "6918" "6100"

Creating an external file with the data formatted as eSMC requires. For diploid organisms, as Arabidopsis, one file per chromosome is created containing data for all the samples. New parameters

only_var <- TRUE    # keep only variant positions
kickout_less100SNPs <- TRUE # at least 100 heterozygous positions
is_homozygous <- TRUE # true if the vcf file is full diploid homozygous (no variants within individual)

# if is_homozygous, then the name of the file will have all samples being used written
if (is_homozygous) {
  all_names <- paste(ind_list, collapse ="-")
  all_names # print the resulting file
}
## [1] "8334-1070-6106-6148-6070-9332-5836-9343-6918-6100"

All the individuals result completely homozygous (diploid) Double loop follows to create the files (one per chromosome with all the individuals, in this case only 1 file)

for (i in 1:length(ind_list)){ # for each individual in the list
   
   ind <- ind_list[i] # sample name
   
   if (is_homozygous) {
     only_var <- FALSE # otherwise it would remove the complete individual!!!
   }
   
   for (chr in chr_list){ # ..for each chromosome in the list
    
    # Get SNP positions of the chromosome `chr`
    SNP_pick <- which(vcf@fix[,"CHROM"]==chr) 
    
    # Get the genotypes for that chromosome (selected positions `SNP_pick`) and invidual `ind`
    vcf_single <- vcf[SNP_pick, c("FORMAT", ind) ] 
    
    # Transform data to tidyverse format
    haptemp <- vcfR2tidy(vcf_single, single_frame = TRUE)
   
    # Build input file for eSMC: 3 rows - Two haplotypes + positions
    hapfile <- as.data.frame(strsplit(haptemp$dat$gt_GT_alleles,"/")) # split genotypes
    colnames(hapfile) <- NULL
    hapfile <- as.matrix(hapfile)
    hapfile <- rbind(hapfile, haptemp$dat$POS) # add position number
    
    # Kick out homozygous sites (if only_var = TRUE)
    if (only_var){keep1 <- stringdist(hapfile[1,], hapfile[2,], "hamming")
                  hapfile <- hapfile[,as.logical(keep1)]}
    
    # Only save output if at least 100 heterozygous positions (if kickout_less100SNPs = TRUE)
    if (kickout_less100SNPs){if (ncol(hapfile)<100){next}}
    
    chr_num <- str_replace(chr, "([|])", "") # Special fix for Windows and MacOS users
    
    if (!is_homozygous){  
      
      if (!only_var){
        write.table(hapfile,
                    file=paste0("inputSMC_", ind, 
                                        "_", chr_num,".txt"),
                    quote = FALSE, row.names = FALSE, col.names = FALSE)
      } else {
        write.table(hapfile,
                    file=paste0("inputSMC_onlyvar_", ind,
                                         "_", chr_num,".txt"),
                    quote = FALSE, row.names = FALSE, col.names = FALSE)              
            }
      } else { # Will produce 1 file with all sample haplotypes (if is_homozygous = TRUE)
        
        if (i == 1) { # if it is the first, make sure it does no append (creates the file again)
          write.table(t(as.matrix(hapfile[1,])),
                      file=paste0("inputSMC_onlyvar_", all_names,  
                                           "_", chr_num,".txt"),
                      quote = FALSE, row.names = FALSE, col.names = FALSE, append = FALSE)   
      
        } else if (i == length(ind_list)) { # if it is the last, append positions as well
          write.table(as.matrix(hapfile[c(1,3),]),
                        file=paste0("inputSMC_onlyvar_", all_names,
                                             "_", chr_num,".txt"),
                        quote = FALSE, row.names = FALSE, col.names = FALSE, append = TRUE)   
        
        }else {
          write.table(t(as.matrix(hapfile[1,])),
                        file=paste0("inputSMC_onlyvar_", all_names,
                                             "_", chr_num,".txt"),
                        quote = FALSE, row.names = FALSE, col.names = FALSE, append = TRUE)   
        }
      }
   }
}
## Extracting gt element GT
## Extracting gt element GQ
## Extracting gt element DP
## Extracting gt element GT
## Extracting gt element GQ
## Extracting gt element DP
## Extracting gt element GT
## Extracting gt element GQ
## Extracting gt element DP
## Extracting gt element GT
## Extracting gt element GQ
## Extracting gt element DP
## Extracting gt element GT
## Extracting gt element GQ
## Extracting gt element DP
## Extracting gt element GT
## Extracting gt element GQ
## Extracting gt element DP
## Extracting gt element GT
## Extracting gt element GQ
## Extracting gt element DP
## Extracting gt element GT
## Extracting gt element GQ
## Extracting gt element DP
## Extracting gt element GT
## Extracting gt element GQ
## Extracting gt element DP
## Extracting gt element GT
## Extracting gt element GQ
## Extracting gt element DP

Set eSMC parameters

To run eSMC, several parameters must be set:

mu = 6.95*10^-9      # Mutation rate
r = 3.6*10^-8        # Recombination rate
rho = r / mu         # ratio recombination/mutation 

ER = FALSE           # estimate recombination rate
SF = TRUE            # estimate selfing rate 
SB = FALSE           # estimate germination rate

# Set boundaries for germination and selfing rate
BoxB = c(0.05, 1)    # min-max values of germination rate 
Boxs = c(0.5, 0.99)  # min-max value of selfing rate

# Number of hidden states in the esmc call 
N=30                 # 30 should be the min. for reliable results

Run eSMC

Loop for each sample, running esmc with the specified list of chromosomes (only Chromosome 5 in this case). In case of diploid-homozygous, only one run is required (all samples are included in one single file) with the list of specified chromosomes. The function below checks if there is only one file given and if the file exists.

# loading the function to run esmc in one sample
calc_esmc <- function(data_files_names, n) {
  #
  # This function will read the listed files, put the information 
  # ...to a matrix and use the function eSMC with
  # ...the parameteres that were set before (BoxB, Boxs) and so on
  #
  # Arguments:
  #           data_files_names: list of the files with the formated data for esmc
  #           n: number hidden states to run (minimum should be 30)
  # Returns:
  #           result: single result from eSMC (or NULL if something went wrong)
  #
  # Future improvements: 
  #           pass also all esmc arguments through the function
  #
  
  data_files <- vector("list", length(data_files_names))
  
  for (j in 1:length(data_files)) {
    input_name <- data_files_names[j]
    # first check that the file exists
    if(!file.exists(input_name)) {
      cat("Problem! File '", input_name, "' does not exist. Will exit the function.", sep="")
      NC <- 0
      break # break the loop
    }
    data_files[[j]] <- as.matrix(read.table(input_name))
  }
  
  NC <- length(data_files) # how many chromosomes
  
  if (NC > 1) { # NC > 1 chromosome
    
    result <- eSMC(n=n, rho = rho, data_files, 
                   BoxB = BoxB, Boxs = Boxs, 
                   SB = SB, SF = SF, 
                   Rho = ER, Check = F, 
                   NC = NC)
    
  } else if (NC == 1){ # NC == 1 chromosome

    result <- eSMC(n=n, rho = rho, data_files[[1]], 
                   BoxB = BoxB, Boxs = Boxs, 
                   SB = SB, SF = SF,
                   Rho = ER, Check = F, 
                   NC = NC)
    
  } else { # NC == 0 just return NULL (maybe something went wrong reading the data)
    
    result <- NULL
    
  }
}

Now loop over the individuals and call the function from previous chunk

chr_nums <- str_replace(chr_list, "([|])", "") # special fix for Windows and MacOS users
  
if (!(is_homozygous)) { 
  # I loop over the list of individuals, I call the function to get
  # .. the esmc result and put it into a list called `results`
  
  results <- vector("list", length = length(ind_list))
  names(results) <- ind_list
  
  for (i in 1:length(ind_list)) {
    ind <- ind_list[i]
    
    if (only_var) {
      data_files_names <- paste("inputSMC_onlyvar_", ind, 
                                "_", chr_nums,  ".txt", sep="") # get the list of files
    } else {
      data_files_names <- paste("inputSMC_", ind, 
                                "_", chr_nums,  ".txt", sep="") # get the list of files
    }
    
    result <- calc_esmc(data_files_names, N)
    # save the file (it can be another name)
    save(result, file=paste("result_", ind, ".RData", sep=""))
    # add the result to the list
    results[[i]] <- result
  } 
  
} else { # if it is a diploid-homozygous. 
  # Just use the file with all stacked individuals
  # ... and call the esmc function to get
  # ... the esmc result (single result, not listed!)

  data_files_names <- paste("inputSMC_onlyvar_", all_names, 
                           "_", chr_num,".txt", sep="") # get the list of files
  
  result <- calc_esmc(data_files_names, N) 
  
  # save the file (it can be another name)
  save(result, file=paste("result_", all_names, ".RData", sep=""))
}
## [1] "theta estimation"
## [1] 1141.489
## [1] "sequence length : 6527989"
## [1] "new mu"
## [1] 0.0001165738
## [1] "new Rho"
## [1] 1521.985
## [1] 1
## [1] "Beta: 1"
## [1] "Self: 0.5"
## [1] "It: 1"
## [1] "rho/theta:" "1"         
## Time difference of 1.692918 mins
## [1] " New Likelihood:  -374377.605803013"
## iter:  0  f-value:  67.42674  pgrad:  0.5 
## iter:  10  f-value:  44.80974  pgrad:  0.3207595 
## iter:  20  f-value:  44.77221  pgrad:  0.01073033 
## iter:  30  f-value:  44.77221  pgrad:  0.004040075 
##   Unsuccessful convergence.
## Time difference of 1.825765 mins
## [1] "It: 2"
## [1] "rho/theta:"       "0.21056418958232"
## Time difference of 1.653693 mins
## [1] " old Likelihood:  -374377.605803013"
## [1] " New Likelihood:  -314185.791580241"
## iter:  0  f-value:  43.56581  pgrad:  0.4527029 
## iter:  10  f-value:  43.37551  pgrad:  0.1764859 
## iter:  20  f-value:  43.34429  pgrad:  0.07681493 
## iter:  30  f-value:  43.33942  pgrad:  0.05407124 
##   Unsuccessful convergence.
## Time difference of 1.787824 mins
## [1] "It: 3"
## [1] "rho/theta:"        "0.260097270758251"
## Time difference of 1.733105 mins
## [1] " old Likelihood:  -314185.791580241"
## [1] " New Likelihood:  -310133.862157974"
## iter:  0  f-value:  49.62631  pgrad:  0.7924321 
## iter:  10  f-value:  49.5124  pgrad:  0.08722267 
## iter:  20  f-value:  49.5032  pgrad:  0.055743 
## iter:  30  f-value:  49.49891  pgrad:  0.04147381 
##   Unsuccessful convergence.
## Time difference of 1.873486 mins
## [1] "It: 4"
## [1] "rho/theta:"        "0.346926502968373"
## Time difference of 1.732786 mins
## [1] " old Likelihood:  -310133.862157974"
## [1] " New Likelihood:  -315751.126572608"
## Time difference of 1.771924 mins
## [1] "no need  for more iteration"
## [1] "estimation of r/mu"
## [1] 0.2600973
## [1] "effect"
## [1] 0.05021322
## [1] "sequence length : 6527989"
## [1] "new mu"
## [1] 0.0001704706
## [1] "new Rho"
## [1] 11528.6
## [1] 5.179856
## [1] "Beta: 1"
## [1] "Self: 0.974246813190171"
## [1] "It: 1"
## [1] "rho/theta:"       "5.17985611510791"
## Time difference of 1.635068 mins
## [1] " New Likelihood:  -373663.11791319"
## iter:  0  f-value:  35.9458  pgrad:  0.5 
## iter:  10  f-value:  26.2596  pgrad:  0.02543022 
## iter:  20  f-value:  27.12313  pgrad:  0.9650856 
## iter:  30  f-value:  26.54907  pgrad:  0.2472721 
##   Unsuccessful convergence.
## Time difference of 1.766802 mins
## [1] "It: 2"
## [1] "rho/theta:"       "5.17985611510791"
## Time difference of 1.661807 mins
## [1] " old Likelihood:  -373663.11791319"
## [1] " New Likelihood:  -283892.73559006"
## iter:  0  f-value:  143.6593  pgrad:  0.7661259 
## iter:  10  f-value:  137.1816  pgrad:  0.5672276 
## iter:  20  f-value:  137.1372  pgrad:  0.362848 
## iter:  30  f-value:  137.1354  pgrad:  0.0835621 
##   Unsuccessful convergence.
## Time difference of 1.791067 mins
## [1] "It: 3"
## [1] "rho/theta:"       "5.17985611510791"
## Time difference of 1.667793 mins
## [1] " old Likelihood:  -283892.73559006"
## [1] " New Likelihood:  -359690.806716759"
## Time difference of 1.708495 mins

Plot of the results

Use x=c() and y=c() to fit the x- and y-axis

# single result
Plot_esmc_results(result, mu, WP=F, LIST=F, check=F, 
                  x = c(10^3, 10^5), y = c(3, 6)) # fit x and y according to your data!!

The results show an overall increase in population size, with a boom around 1500 generations ago.

PCAdapt

PCAdapt is a method to identify variants that highly differs between populations, allowing to test for selection due to external inputs, such as different environmental pressures. It is based on PCA, therefore it should be used after structure analysis in order to interpret possible variant (SNPs) outliers.

Libraries

#install.packages("pcadapt")
library(pcadapt)
library(ggplot2)

Loading genotype data

The data is in VCF format (the same used previously)

arab.vcf <- "C:/Users/Camilla/Documents/UZH.internship/Courses&conferences/Adaptomics/data/data_sweden_sweep_region_chr5_snps_only_fixed.vcf"
pcad <- read.pcadapt(arab.vcf, type = "vcf")
## Warning in file2other(input, type, match.arg(type.out), match.arg(allele.sep)):
## Converter vcf to pcadapt is deprecated. Please use PLINK for conversion to bed
## (and QC).
## No variant got discarded.
## Summary:
## 
##  - input file:               C:/Users/Camilla/Documents/UZH.internship/Courses&conferences/Adaptomics/data/data_sweden_sweep_region_chr5_snps_only_fixed.vcf
##  - output file:              C:\Users\Camilla\AppData\Local\Temp\RtmpuMCT5n\file218066da61ff.pcadapt
## 
##  - number of individuals detected:   249
##  - number of loci detected:      7561
## 
## 7561 lines detected.
## 249 columns detected.
pcad
## [1] "C:\\Users\\Camilla\\AppData\\Local\\Temp\\RtmpuMCT5n\\file218066da61ff.pcadapt.bed"
## attr(,"n")
## [1] 249
## attr(,"p")
## [1] 7561
## attr(,"class")
## [1] "pcadapt_bed"

Choosing the number of K principal components

First, PCA is performed on the centered and scaled genotype matrix. The second stage consists in computing test statistics and p-values based on the correlations between SNPs and the first K principal components (PCs). To run the function pcadapt, the user should specify the output returned by the function read.pcadapt and the number K of principal components to compute. To choose K, principal component analysis should first be performed with a large enough number of principal components (e.g. K=20)

# K=20 and scree plot 
x <- pcadapt(input = pcad, K = 20, ploidy = 2)
plot(x, option = "screeplot")

# K=10 and scree plot
x <- pcadapt(input = pcad, K = 10, ploidy = 2)
plot(x, option = "screeplot")

# K=5 and scree plot
x <- pcadapt(input = pcad, K = 5, ploidy = 2)
plot(x, option = "screeplot")

K = 4 seems to be a suitable number of principal components.

Score Plot

Another option to choose the number of PCs is based on the 'score plot' that displays population structure. The score plot displays the projections of the individuals onto the specified principal components. Using the score plot, the choice of K can be limited to the values of K that correspond to a relevant level of population structure.

Results of the PCA The individuals seem to be clustered into 3 or 4 population along the PC1. For PC2, 3 and 4, the individuals do not seem to be clearly separated into groups, overall there is a continuous distribution in the loading weights with only small subgroups.

Compute the test on K = 4

Plot

Manhattan plot (no outlier threshold) - to display significant SNPs There is a peak of significant p-values around position 5000, these may be relevant variants for the separation of the samples into groups across the PC1.

QQplot Histogram of p-values There is a peak in the distribution of the P-values around 0, which may correspond to outlier loci.

Choose the cutoff

Using the Bonferroni correction

# Find significant outliers
padj.kx <- p.adjust(pcadapt.kx$pvalues, method="bonferroni")
alpha <- 0.01
outliers.kx <- which(padj.kx < alpha)
length(outliers.kx)     ## 82 outliers
## [1] 82
# Set red color for outliers
padj.kx.rmna <- padj.kx[!is.na(padj.kx)]
sig_col <- rep("black", length(padj.kx.rmna))
sig_col[which(padj.kx.rmna < alpha)] <- "red"

Manhattan plot with outliers The plot shows in red the 82 outliers identified. It could be interesting to investigate further possible associations between these variants and the PCs.

Association between PCs and outliers

Associate the significant SNPs to the Principal Components

##    SNP PC
## 1 5239  1
## 2 5260  1
## 3 5268  1
## 4 5292  1
## 5 5295  1
## 6 5296  1

Further analyses should be conducted on this list of candidate SNPs to may be able to link these variants to possible phenotypic traits that are the result of adaptation of different environmental conditions. Combined, these three methods can be useful exploratory investigations to get a direction for further more in-depth studies.